Data Exploratory Analysis

There are 3,400 Census dissemination areas in Metro Vancouver, ranging from 0.01 to 145 km^2 in land areas. This census area accounts for over 955,000 households and 245,000 people.

Map of Key Variables on Housing by Dissemination Areas

We extracted some key relevant variables on housing, by dissemination area, such as: * Medium Shelter Value, * Medium Total Household Income, * Average Household Size, and * Population Density

Mouse over the map for values of the variables for each area

You can examine these variables on a series of interactive map below:

 

While the interactive maps provide us with a general impression about how housing values, income, household size and population density can vary across the Metro Vancouver region, it is difficult draw correlations between housing value and different variables. Let’s look at this closer in the next section of the EDA.

Relationships between key variables and Housing Value

The relationship between population density and housing value is nonlinear. Higher population density areas may have smaller units of housing that are less expensive.

The relationship between housing value and total population in the area seems to be inversely proportional.

Housing value generally increase as the medium household income in the dissemination area increases.

A larger household also correlates with higher housing value, this can be potentially due to the need for more space and a bigger house.

A Simple Regression Model

Model 1

housing_value_model <-
  lm(shelter_val_med ~ income_hh_med + hhsize_avg + inv_pop_density + inv_emp_tot,
     data = census_data_rev)
summary(housing_value_model, corr = T)
## 
## Call:
## lm(formula = shelter_val_med ~ income_hh_med + hhsize_avg + inv_pop_density + 
##     inv_emp_tot, data = census_data_rev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1468255  -374222  -117492   251981  3597853 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4.285e+05  5.185e+04  -8.264  < 2e-16 ***
## income_hh_med    9.980e+00  4.674e-01  21.353  < 2e-16 ***
## hhsize_avg       8.240e+04  1.958e+04   4.208 2.64e-05 ***
## inv_pop_density -3.461e+06  2.335e+06  -1.482    0.138    
## inv_emp_tot      1.300e+08  6.640e+06  19.582  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 590700 on 3298 degrees of freedom
##   (98 observations deleted due to missingness)
## Multiple R-squared:  0.2855, Adjusted R-squared:  0.2846 
## F-statistic: 329.5 on 4 and 3298 DF,  p-value: < 2.2e-16
## 
## Correlation of Coefficients:
##                 (Intercept) income_hh_med hhsize_avg inv_pop_density
## income_hh_med   -0.21                                               
## hhsize_avg      -0.65       -0.47                                   
## inv_pop_density  0.01       -0.09          0.02                     
## inv_emp_tot     -0.29       -0.11         -0.06      -0.01

Model 2

housing_value_model <-
  lm(shelter_val_med ~ income_hh_med + hhsize_avg + inv_pop_tot + inv_emp_tot,
     data = census_data_rev)
summary(housing_value_model, corr = T)
## 
## Call:
## lm(formula = shelter_val_med ~ income_hh_med + hhsize_avg + inv_pop_tot + 
##     inv_emp_tot, data = census_data_rev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2387131  -374891  -122229   263875  3597562 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.562e+05  5.382e+04  -6.618 4.24e-11 ***
## income_hh_med  1.034e+01  4.724e-01  21.889  < 2e-16 ***
## hhsize_avg     7.469e+04  1.959e+04   3.812 0.000141 ***
## inv_pop_tot   -1.370e+08  2.875e+07  -4.766 1.96e-06 ***
## inv_emp_tot    1.730e+08  1.120e+07  15.442  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 588900 on 3298 degrees of freedom
##   (98 observations deleted due to missingness)
## Multiple R-squared:  0.2899, Adjusted R-squared:  0.2891 
## F-statistic: 336.6 on 4 and 3298 DF,  p-value: < 2.2e-16
## 
## Correlation of Coefficients:
##               (Intercept) income_hh_med hhsize_avg inv_pop_tot
## income_hh_med -0.15                                           
## hhsize_avg    -0.65       -0.48                               
## inv_pop_tot   -0.28       -0.19          0.09                 
## inv_emp_tot    0.06        0.09         -0.11      -0.81

What do you think of these two models? What other variables you think we should be looking at?